Skip to content

Kafka Connect: Surface commit failures instead of silently swallowing them#16237

Open
yadavay-amzn wants to merge 1 commit into
apache:mainfrom
yadavay-amzn:fix/iceberg_15878
Open

Kafka Connect: Surface commit failures instead of silently swallowing them#16237
yadavay-amzn wants to merge 1 commit into
apache:mainfrom
yadavay-amzn:fix/iceberg_15878

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

@yadavay-amzn yadavay-amzn commented May 7, 2026

Fixes #15878.

Problem

The Kafka Connect Coordinator previously caught Exception around doCommit() and only logged a warning, so when a commit failed (e.g., a CommitFailedException from Glue detecting a concurrent table update), the connector stayed RUNNING while silently dropping the data that was in flight.

Fix

Remove the catch-all around doCommit() and instead log at ERROR level with the task id and commit id before rethrowing. CoordinatorThread.run() already terminates the thread on uncaught exceptions, which transitions the Kafka Connect task to FAILED — so failures are now surfaced rather than dropped.

The finally block that calls commitState.endCurrentCommit() is preserved so per-commit state is cleaned up regardless of the outcome.

Testing

  • Added testCommitFailedExceptionPropagates which mocks a catalog-side CommitFailedException on AppendFiles.commit() and asserts it propagates out of Coordinator.process(). Without the fix, this test fails because the exception is swallowed.
  • Updated two existing tests (testCoordinatorWithBadDataFile and testCoordinatorCommittedOffsetValidation) that previously relied on silent-swallow behaviour; they now assert the specific exception propagates (IllegalArgumentException for bad partition spec, ValidationException for stale offsets).
  • Full TestCoordinator suite passes locally (8/8).
  • spotlessApply passes.

Copy link
Copy Markdown
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch and cleanup.

However, the error logging strategy you are proposing seems to be double-logging every commit failure in CoordinatorThread.run(). I have left some specific suggestions.

Comment on lines +157 to +162
LOG.error(
"Coordinator {} failed to commit for commit {}; propagating failure to terminate task",
taskId,
commitState.currentCommitId(),
e);
throw e;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to

    throw new RuntimeException(
          String.format("Coordinator %s failed to commit %s",
              taskId, commitState.currentCommitId()),
          e);

This allows the further up CoordinatorThread.run() catch to log the error once, and still attribute the error to this location.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, updated in latest revision

ImmutableList.of(),
EventTestUtil.now()))
.isInstanceOf(CommitFailedException.class)
.hasMessageContaining("Glue detected concurrent update");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do the above change, then i think this need to be :

        .hasRootCauseMessage("Glue detected concurrent update");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

Thanks @Baunsgaard for taking a look, you're right about the double-logging.
I've pushed an update with your recommended changes, please take a look when you get a chance. Thanks!

Copy link
Copy Markdown
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left one nit for production code. Tests looks fine!

Comment on lines +151 to +155
// Do not swallow commit failures: wrap with Coordinator context and propagate so
// CoordinatorThread.run() terminates and the Kafka Connect task transitions to FAILED
// instead of silently dropping data (e.g., CommitFailedException from catalogs that
// detect concurrent updates). The taskId and commitId are embedded in the wrapper
// message so that the single log emitted by CoordinatorThread retains the context.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it is too much to leave this comment. It is a personal preference, but I would remove it.

… them

The Coordinator previously caught all exceptions from doCommit() and only
logged a warning, causing the connector to stay RUNNING after a
CommitFailedException (e.g., Glue concurrent update) while silently
dropping data. Propagate the exception so CoordinatorThread terminates
and the Kafka Connect task transitions to FAILED.

Fixes apache#15878
@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

Done — removed the comment block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Kafka Connect] Connector enters silent broken state after CommitFailedException (Glue concurrent update) — no data written, no error surfaced

2 participants